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MicroRNAs (miRNAs) play important roles in multiple biological processes and have attracted much 
scientific attention recently. Their expression can be altered by environmental factors (EFs), which are 
associated with many diseases. Identification of the phenotype-genotype relationships among miRNAs, EFs, 
and diseases at the network level will help us to better understand toxicology mechanisms and disease 
etiologies. In this study, we developed a computational systems toxicology framework to predict new 
associations among EFs, miRNAs and diseases by integrating EF structure similarity and disease phenotypic 
similarity. Specifically, three comprehensive bipartite networks: EF-miRNA, EF-disease and 
miRNA-disease associations, were constructed to build predictive models. The areas under the receiver 
operating characteristic curves using 10-fold cross validation ranged from 0.686 to 0.910. Furthermore, we 
successfully inferred novel EF-miRNA-disease networks in two case studies for breast cancer and cigarette 
smoke. Collectively, our methods provide a reliable and useful tool for the study of chemical risk assessment 
and disease etiology involving miRNAs. 

MicroRNA (miRNA) is a newly identified type of small non-coding RNA that downregulates gene 
expression at the post-transcriptional level by inhibiting translation of mRNA or degrading 
mRNA 1 4 . As important regulators of at least 60% of all protein-coding gene expression, miRNA net- 
works have become an important research field of the systems biology 5 . miRNA expression profiles can be altered 
by toxic environmental factors (EFs), such as radiation 6 , pollution 7 , cigarette smoke 8 , and others. The gene 
networks targeted by miRNAs may change with altered miRNA expression. These changes ultimately cause 
diverse diseases, such as cancer 9 , neurological diseases 10 and cardiovascular diseases 11 . Thus, miRNA networks 
bridge the toxicology mechanism gap between EFs and diseases, providing useful information for interpreting EF 
toxicity and disease etiology 1215 . For example, in one study, miR-31 expression in normal respiratory epithelia 
and lung cancer cells was induced by cigarette smoke, resulting in lung cancer 16 . In another study, two well-known 
endocrine disrupting compounds, bisphenol A (BPA) and dichlorodiphenyltrichloroethane (DDT), could alter 
the miRNA expression profiles of MCF-7 breast cancer cells including estrogen-regulated onco-miR-21. This 
displays the toxicology mechanisms of xenoestrogens and the pathology of breast cancer in a new perspective 17 . 
Although investigations of the associations among EFs, miRNAs and diseases are gaining increasing attention 
and becoming a hot research field, experimental studies are time-consuming and costly due to the huge number of 
EFs available for analysis. 

As the number of experimental data has increased rapidly, computational models provide useful tools for 
identifying new human health hazards associated with EFs. Computational methods can be divided into classic 
quantitative structure-activity relationships (QSARs) and computational systems toxicology approaches. The 
latter has advantages against classic QSAR models, such as the OECD QSAR Toolbox (http://www.oecd.org/ 
chemicalsafety/risk-assessment/theoecdqsartoolbox.htm) and admetSAR 18 . In our previous study, we developed 
predictive toxicogenomics-derived models (PTDMs) to predict chemical-gene-disease associations using the 
network-based inference (NBI) algorithm 1 '. Other computational systems toxicology approaches have also been 
published to study the disease etiologies caused by proteins 20 and chemical metabolism 21 . However, the toxicology 
mechanisms of EF exposure and disease etiology remain a major topic of research today 22 . The recent appearance 
of miRNAs has provided huge opportunities for the development of computational models from a systems 
biology perspective, and computational methods have been developed to predict potential associations in 
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Figure 1 | Diagram of the computational systems toxicology framework, (a) The original data were collected from the Human MiRNA Disease Database 
and miREnvironment Database, and used to construct three bipartite networks: the EF-miRNA association (EMA), EF-disease association (EDA), and 
miRNA-disease association (MDA) networks, (b) Three methods, network-based inference (NBI), EF structure similarity-based inference (ES-SBI) and 
disease phenotypic similarity-based inference (DP-SBI), were developed to build the predictive model designated the predictive EF-miRNA-disease 
association model (PEMDAM). (c) The PEMDAM was built using the intersection of both of the prioritized lists from NBI and SBI. (d) Network 
visualization and analysis. EF: the environmental factor; S T : the Tanimoto similarity between two EFs; S s : the phenotypic similarity between two diseases. 



miRNA related networks. Qiu etal. uncovered a number of biological 
patterns of EF-miRNA interactions and proposed a computational 
model to predict new EF-disease associations 23 . Jiang et al. con- 
structed cancer specific networks to identify the biological links 
between small molecules and miRNAs 24 . Chen et al. reported a 
method named miREFScan to predict disease-related EF-miRNA 
associations using a semi-supervised classifier 25 . Currently, there is 
still a great need for feasible, effective and/or efficient models. 

In this study, we developed a computational systems toxicology 
framework to predict miRNA networks by systematic integration of 
EF structure similarity and disease phenotypic similarity. Specially, 
we constructed three high-quality bipartite networks: EF-miRNA, 
EF-disease and miRNA-disease associations, to build predictive 
computational systems toxicology models. High predictive perform- 
ance was achieved in 10-fold cross validation. Furthermore, two case 
studies were performed to illustrate the predictive capability of the 



constructed framework. Collectively, the developed computational 
model provides new useful tools to elucidate the mechanisms of 
environmental toxicity and disease etiologies at the miRNA level. 

Results 

Overview of the computational systems toxicology framework. 

We proposed a new computational systems toxicology framework 
to predict putative EF-miRNA-disease associations. As shown in 
Figure 1, three bipartite networks: EF-miRNA association (EMA), 
EF-disease association (EDA) and miRNA-disease association 
(MDA), were constructed. The EMA network included 1,770 
associations between 184 EFs and 395 miRNAs, while the MDA 
network consisted of 6,466 associations connecting 569 miRNAs 
and 396 diseases. The EDA network contained 320 associations 
linking 171 EFs and 115 diseases (Table 1). More detailed infor- 
mation is provided in Supplementary Table SI. Next, we used 
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Table 1 | Datasets of the known EMAs, MDAs and EDAs used in 
this study 
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320 



EMAs: EF-miRNA associations; MDAs: miRNA-disease associations; EDAs: EF-disease 
associations; N^: the number of EFs; N m : the number of miRNAs; Nd: the number of diseases; Na: 
the number of associations. 



three network-based methods, including network-based inference 
(NBI) 26 , EF structure similarity-based inference (ES-SBI) and 
disease phenotypic similarity-based inference (DP-SBI), to build a 
predictive EF-miRNA-disease association model (PEMDAM). 
Finally, the PEMDAM was validated using 10-fold cross validation 
and applied to two case studies on breast cancer and cigarette smoke. 

Network characteristics of the known EF-miRNA-disease asso- 
ciation network. The MDA network displays the miRNA 
signatures of specific diseases, which is helpful for studying the 
pathological mechanisms of these diseases. We identified eight 
modules with sizes ranging from 31 to 6 based on the MDA 
network using the Cytoscape plugin MCODE 27 (Figure 2). In these 
modules, the common miRNA signatures between diseases were 
displayed. For example, as shown in module 1, two psychiatric 
diseases, schizophrenia and autistic disorder, shared mir-15a, 
which was confirmed to target genes, such as regulator of G- 
protein signaling 4 (RGS4), glutamate receptor metabotropic 7 
(GRM7), glutamate receptor subunit 3A (GRIN3A) and visinin-like 
1 (VSNL1) 2 ". Furthermore, the miRNAs from different families were 
depicted in various colors, which illustrates that miRNAs in the same 
family share the same important seed-pairing region and conse- 
quently tend to have similar functions. The most obvious miRNA 
family found is the let-7 family that has four members in module 1 
and six members in module 2. In module 1, the four let-7 members 
cooperate with each other in three diseases: myelodysplastic syn- 
dromes, head & neck squamous cell carcinomas and retinoblasto- 
mas. In module 2, all six of the let-7 members play important roles in 
inflammation and nasopharyngeal neoplasms. In addition, the 
members of the mir-193 family function together in both chronic 
atrial fibrillation and myotonic dystrophy, as shown in module 6. 
Other miRNA family members, mir-9, mir-19, mir-29, mir-34, and 
mir-181, were also found to cooperate in specific diseases. 

In addition, the three classical network parameters connectivity 
(K), clustering coefficient (C) and betweenness (B) were calculated to 
measure the topological features of the EMA, EDA and MDA net- 
works, respectively (Supplementary Fig. SI). Most bionetworks are 
scale-free networks whose connectivity follows a power-law distri- 
bution 29 . In our bipartite networks, the minority nodes have high 
degrees while the majority nodes have low degrees. The disease with 
the highest connectivity is breast cancer, which is associated with 287 
miRNAs in the MDA network and 26 EFs in the EMA network. The 
most studied EFs are radiation, hypoxia and 17beta-estradiol. The 
clustering coefficient measures the local density of links and their 
tendency to form clusters or communities of nodes. The average 
clustering coefficients in our study ranged from 0.087 to 0.206. 
Although the EDA network is comparatively smaller than the 
MDA network, the component nodes connect closely with each 
other, thus their clustering coefficients are relatively high. A node's 
betweenness is defined by the fraction of all of the shortest paths 
between all nodes in the network that pass through the node. In all 
three networks, only a few nodes have high betweenness values while 
many nodes have very low betweenness values. Collectively, the 
EMA, EDA and MDA networks are similar to other bionetworks; 



however, they are relatively sparse and not well defined, which leaves 
plenty of room for research and reveals a need to find new methods to 
predict miss-links in the networks. 

Performance of the computational systems toxicology model. 

miRNA-disease association prediction. The prediction of new 
candidate MDAs is the basis for studying individual miRNA roles 
in disease pathogenesis. A comprehensive MDA network supported 
by experimental evidence was collected from the HMDD and 
miREnvironment databases. In the PEMDAM, the predicted list of 
new candidate diseases linked to miRNAs was obtained using NBI 
algorithm, while the prediction of new candidate miRNAs linked to 
diseases was found by combining NBI with DP-SBI. The prediction 
of putative diseases linked to miRNAs (NBI_Dis2miR) achieved an 
AUC of 0.910. A high AUC of 0.875 was also achieved when 
prioritizing new candidate miRNAs linked to diseases using NBI 
(NBI_miR2Dis) versus 0.810 by DP-SBI (SBI_miR2Dis). These 
results showed the high predictive accuracy of our PEMDAM 
toward the prediction of new candidate MDAs. 

EF-disease association prediction. New EDA predictions could 
enhance our knowledge about how EFs affect our health. To this 
end, known EDA data were extracted from the miREnvironment 
database. Prediction of EDAs involved prioritizing new candidate 
EFs linked to diseases and also prioritizing new candidate diseases 
linked to EFs. When prioritizing new candidate EFs linked to dis- 
eases, NBI and DP-SBI were applied (NBI_EF2Dis, SBI_EF2Dis). In 
addition, NBI and ES-SBI were used to predict new candidate dis- 
eases linked to EFs (NBI_Dis2EF, SBI_Dis2EF). Heat maps of EF 
structure similarity and disease phenotypic similarity are given in 
Supplementary Figure S2. AUC values of 0.789, 0.686, 0.827, and 
0.787 were obtained for NBI_EF2Dis, NBI_Dis2EF, SBI_EF2Dis, 
and SBI_Dis2EF, respectively. As shown in Figure 3, integrating EF 
structure similarity and disease phenotypic similarity with the NBI 
algorithm would greatly improve the performance of the PEMDAM. 

EF-miRNA association prediction. Carcinogens and drugs are two 
major types of EFs. Prediction of new EMAs will help to understand 
the underlying mechanisms of xenobiotic toxicity. The PEMDAM 
was built based on a known EF-miRNA bipartite network collected 
from the miREnvironment database. The prioritization of new can- 
didate EFs linked to miRNAs was obtained by NBI (NBI_EF2miR), 
while the prediction of new candidate miRNAs linked to EFs was 
found by combining NBI (NBI_miR2EF) and ES-SBI (SBI_miR2EF). 
NBI_EF2miR achieved an AUC of 0.886, and the prioritization of 
new candidate miRNAs linked to EFs obtained an AUC of 0.787 by 
NBI, and an AUC of 0.705 by SBI. Collectively, our PEMDAM was 
verified to be reliable for predicting new candidate EMAs. 

Case study 1 : discovery of new risks for breast cancer. Breast cancer 
is the most common neoplasm in women and caused 458,503 deaths 
worldwide in 2008 30 . Moreover, the breast cancer phenotype is the 
most studied disease on the miRNA level 31 , having the highest 
degrees in both the EMA and MDA networks. The dataset used to 
build this predictive model contained >300 associations related to 
breast cancer supported by —300 experimental documents. 
Prioritizing new potent EFs and miRNAs linked to breast cancer 
would improve our knowledge of breast cancer etiology. Thus, the 
predicted lists for breast cancer were extracted from the final 
prioritized lists from our PEMDAM as a case study, and a sub- 
network was constructed with Cytoscape for network analysis. 

Six new candidate EFs associated with breast cancer were pre- 
dicted based on the common top 10 candidates using both NBI 
and DP-SBI methods. Interestingly, all of the predicted EFs (6/6, 
100%) related to breast cancer were found to be supported by experi- 
mental evidence in the literature (Supplementary Table S2). Due to 
research bias, these EFs haven't been studied with respect to miRNA 
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expression changes related to breast cancer. However, this informa- 
tion can be discovered using the PEMDAM. Information about the 
associated miRNAs of the six new candidate EFs prioritized for 
breast cancer were extracted from known networks. In total, 40 
potential miRNAs for breast cancer were obtained through utilizing 
the common candidates of the top 50 lists by both NBI and DP-SBI. 
Among the 40 new candidate miRNAs prioritized for breast cancer, 



39 (97.5%) miRNAs were validated by databases or newly published 
literature (Supplementary Table S3). For these validated miRNAs, 
the EFs that can alter their expression were also extracted from the 
entire network. The putative lists shown in Supplementary Tables S2 
and S3 are very promising for further study. For example, radiation 
may alter the expression of 32 breast cancer related miRNAs, and 
mir-181b maybe another miRNA that plays an important role in the 
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tobacco related pathology of breast cancer. Figure 4 shows a global 
breast cancer network constructed with known and predicted EMAs, 
MDAs and ED As. The network includes 32 EFs and 327 miRNAs 
related to breast cancer. In the center of the network, 219 miRNAs 
are specific for EFs, thus, these miRNAs may be developed as bio- 
markers of breast cancer for people who are exposed to these toxic 
EFs. Although the miRNAs in the periphery are not defined to be 
associated with specific EFs, they are quite important for understand- 
ing the pathology of breast cancer. 

Interestingly, some of the EFs are drugs. Studies about associations 
among drugs, miRNAs and diseases will help to increase our know- 
ledge about polypharmacology and personalized medicine. Breast 
cancers were classified into two major subtypes: luminal and basal 
subtypes. Here, we tried to make predictions for drug-disease asso- 
ciations based on the above two breast cancer subtypes. 5 known 
associations among subtypes and specific drugs were collected from 
published literatures 32,33 and added into our computational frame- 
work. Predicted lists were obtained by the top 10 lists using the NBI 
algorithm (Supplementary Table S4). As there are not enough known 
data about subtypes, predicted lists here need more experiments for 
validation. With sufficient compound-disease associations based on 
specific disease subtypes collected, our computational approaches 
will perform better. 

Collectively, the predictive computational systems toxicology 
model developed here is valuable and can reliably predict potential 
new EF exposure risks and miRNA biomarkers to help increase our 
understanding of breast cancer etiology. Moreover, our computa- 
tional program showed predictive capability for subtype specific 
drug-disease associations. 

Case study 2: discovery of new hazards from cigarette smoke. 

Approximately 1.3 billion people smoke cigarettes, which results in 
5 million preventable deaths per year 34 . Cigarette smoke contains 
many toxic components and has been found to alter a number of 
genetic factors, including miRNAs. These miRNAs may be used as 
biomarkers for the diagnosis and progression of the diseases of 
tobacco smokers 35 and help to elucidate the biological mechanisms 
of tobacco toxicity. In this study, two of the major carcinogens in 
cigarettes: nicotine and benzo(a)pyrene (BaP), were included in 
addition to tobacco. In total, 58 miRNAs were found to be 
experimentally altered by cigarette smoke and contributed to the 



pathology of seven smoking-related diseases. Among them, mir- 
128 was strongly affected by cigarette smoke and played an 
important role in the host response by regulating the target gene 
MAFG 36 . miR-31 was verified as an oncomiR during lung cancer 
progression and its expression can be induced by cigarette 
smoke 16 . miRNA expression changes were also related to maternal 
cigarette use during pregnancy and poor fetal outcome 37 . An 
increasing amount of research has been focused on the changes in 
miRNA expression caused by tobacco smoke. 

In order to further examine how tobacco influences human health 
at the miRNA level, predicting new candidate miRNAs and new 
disease risks for tobacco use were performed using the PEMDAM. 
Because tobacco is a mixture without a specific structure, the pre- 
dicted lists were obtained only by NBI. Predicted lists for nicotine 
and benzo(a)pyrene were generated by both the NBI and ES-SBI 
methods. Supplementary Tables S5 and S6 list the top 5 miRNAs 
and top 5 diseases for tobacco prioritized by NBI. In addition, 5 
potential miRNAs and 5 potential diseases were prioritized for nic- 
otine, while 4 new candidate miRNAs and 4 new candidate diseases 
were predicted for benzo(a)pyrene by the common top 10 lists in the 
NBI and ES-SBI methods. Related diseases were extracted from the 
whole network for the potential miRNAs that were predicted to be 
altered by cigarette smoke. Meanwhile, the known MDAs were also 
extracted from our model for the candidate diseases prioritized for 
cigarette use. Collectively, inferring new miRNA biomarkers could 
improve our understanding of the relationships between cigarette 
smoke and smoking-related diseases. The predicted associations 
among tobacco smoke, miRNAs and diseases (Supplementary 
Tables S5 and S6) provide potential candidates for further experi- 
mental validation. For example, tobacco was predicted to alter the 
expression of mir-155, mir-221, let-7a-l and mir-126, which play 
important roles in lung neoplasm pathology. Although there are 
some newly published 8 ' 38 studies for tobacco smoke, there are still 
not enough data to validate the performance of the PEMDAM. The 
entire network of tobacco smoke (Figure 5) was constructed with the 
known and predicted EMAs, MDAs and ED As. This network con- 
tains 58 miRNAs and 7 diseases, which were confirmed to be assoc- 
iated with cigarette smoke by experimental studies. 14 predicted 
EMAs and 14 prioritized EDAs related to cigarette smoke were also 
included. 

Discussion 

miRNA network analysis will open up new avenues for the under- 
standing of environmental toxicity and disease etiology. In addition, 
miRNA networks have several advantages over other types of bionet- 
works. miRNAs are located upstream of gene signal transduction, 
thus changes in miRNA expression are more sensitive and occur 
before changes in proteins. Furthermore, because miRNAs can be 
easily detected in circulation, they are suitable as sensitive indicators 
of toxic exposure or novel biomarkers for the prevention, diagnosis 
and progression of EF-related diseases 39 . 

Our predictive computational systems toxicity model obtained a 
high accuracy in prioritizing the potential associations among EFs, 
miRNAs and diseases. This high performance is likely due to three 
factors: the data quality, the design of the algorithm and the workflow 
strategy. Firstly, the data used to build the predictive model were 
obtained from highly reliable databases and supported by experi- 
mental data 40,41 . In network analysis, including topological features 
and modules, it is necessary not only to have an overall understand- 
ing of the dataset used but also to ensure that these known networks 
conform to the inherent nature of bionetworks, which are small 
world 42 , scale-free 29 . These network topological characteristics are 
of great importance for the algorithms we used. Secondly, the NBI 
and SBI algorithms used in this paper were well defined and have 
already been proven to be successful for predicting drug- target inter- 
actions 26-43 and chemical-gene-disease associations 19 . Only two mod- 
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between these miRNAs and EFs. 



els were needed to predict the associations in one bipartite bionet- 
work, thus the computational workload was greatly reduced. Last but 
not least, the PEMDAM has the advantages of both NBI and SBI 
because the final prediction results were obtained by utilizing the 
common lists of both NBI and SBI. For NBI, only the network topo- 
logy structure similarity was needed, which was easily obtained, 
while SBI was only applied when specific similarities like structural 
similarity and phenotypic similarity are available. However, SBI per- 
formed better than NBI in small networks, such as the EDA network. 



Thus, using the common prioritized lists made the predicted results 
more reliable than using a single algorithm. 

There are some limitations and room for improvement in our 
current methods. First, the present model can only predict new asso- 
ciations among known EFs, miRNAs and diseases. Our current 
model is unable to predict brand new EFs, miRNAs and diseases 
without having known association information in the training set. 
This could be improved by adding similarities to homogeneous 
nodes in a bipartite network. Based on its similarity to other nodes, 
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the initial resource could be defined to include nodes without known 
links. Furthermore, our methods focused on nodes and their rela- 
tionships in bipartite networks. Thus, it was a simplified model that 
ignored detailed mechanisms of interaction, which differs from real 
and complicated biological systems. EFs alter miRNA expression in 
directional ways, positively or negatively. There have also been 
inconsistencies in miRNA expression changes under the same 
experimental conditions. For example, in MCF-7 (ER+) breast can- 
cer cells, oncomiR-21 was found to be down-regulated by estra- 
diol 44,45 in one study, but was found to be up-regulated by estradiol 
in another 46 . Expression profiles of the same miRNA can also vary 
across different samples of the same disease. As the underlying 
mechanisms are revealed, a directional network of interactions 
among EFs, miRNAs and diseases will be up for consideration. 
Finally, there is also room to improve our algorithm in handling 
small networks or sub-networks. The similarity of miRNAs, for 
example, their functional similarity 47 , could be integrated into SBI. 



All of the methods applied in this paper are data-driven 
approaches that depend on the quantity and quality of the evalu- 
ation datasets 48 for good performance. Currently, the known 
information about miRNA networks, especially involving envir- 
onmental toxicity, is notably sparser than other networks. As more 
experiments are carried out, there will be enough data for the 
external validation and literature verification of further case stud- 
ies. It will then be possible to compare different predicted miRNA 
results using various computational programs 49 . As the experi- 
mental dataset becomes enriched, computational systems toxico- 
logy programs will perform better, resulting in the development of 
experimental studies. We generated a comprehensive prediction 
list, the 'PEMDAM lists', that includes all of the potential 
MDAs, EDAs and MDAs found by our computational program. 
Researchers interested in EF-miRNA-disease associations can 
download the profile for further experimental validation (www. 
lmmd.org/database/pemdam) . 
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Methods 

Construction of the miRNA networks. Data prepa ration. Three association datasets, 
EMA, EDA and MDA, were collected from the miREnvironment database 40 
(September, 2012) and the Human MicroRNA Disease Database (HMDD) 41 
(September, 2012). Only data tested on humans was kept. Because the same EFs, 
diseases or miRNAs might have different names in the databases, all of the EF and 
disease terms were annotated with the most commonly used vocabularies of the 
Unified Medical Subject Headings (MeSH) 50 , and the miRNAs were named according 
to miRBase 51 . After removing duplicated data, the remaining data were integrated to 
construct the network. 



graph (more details are given in Wang et al. 47 ) Formulas (3) and (4) describe the 
predicted scores of the unknown ED As and MDAs, respectively, where S s (dj,di) 
denotes the phenotypic similarity between two diseases d 4 & di and ay represents the 
adjacency matrix of N(E,D,A) in and N(M,D,A) in M®. 

E S s (dt,di)ai] 
if=^ 0) 

E s s (diA) 



Network construction. The complete network of EFs, miRNAs and diseases was 
transformed into three bipartite networks: EMA, EDA and MDA. The three networks 
were further transformed into quantitatively descriptive matrices. The EF set was 
denoted as E — {e!,e 2 ,...,e n }, while M = {m 1> m 2 ,... > m n } and D = {di^,...^} 
represented the miRNA and disease sets, respectively. The EMA bipartite pairs were 
then represented as N(E,M,A), where A — {a^: e f e E, rrij e M}, the EDA network pairs 
were represented as N(E,D,A), where A = {a^: ei e E, dj G D}, and the MDA network 
pairs were represented as N(M,D,A), where A = {ay: e M, dj e D}. In this way, the 
EMA, EDA and MDA bipartite networks were represented as n X m adjacent mat- 
rices, where ay — 1 if direct experimental data exists in the above two databases, and 0 
otherwise. 

Measurement of the network topology. In order to gain a full understanding of the 
constructed networks, the Cytoscape plugin MCODE 27 was applied to define the 
modules in the MDA network, and NetworkX (http://networkx.lanl.gov/, version 
1.8.1) was used to calculate three classical topological features, connectivity (k), 
clustering coefficient (C) and betweenness (B), for the EMA, EDA and MDA 
networks. 

Method development. Network-based inference (NBI). Network-based inference is 
an algorithm that allocates known initial resources to obtain predictive lists. Figure 1 
shows a simple EMA example to illustrate how to use this network-based inference 
algorithm to prioritize unknown miRNAs linked to EFs. The initial resources for a 
given EF in the bipartite network N (EMA) are located in the miRNAs, which are 
associated with Ci, Each miRNA averages its resources to all of its neighbors, and they 
immediately redistribute these resources to every neighboring miRNA. Finally, the 
miRNAs that are not connected with e^ are assigned the end resources, which is their 
score. In theory, the higher score a candidate miRNA gets, the more likely it is to be 
associated with ej. The initial resources of a^ between ej (the yellow triangle) and nij 
(the green circle) was found as follows: by denoting F 0hX „, as the initial resource and 
setting F 0i j = fly, R nX n as the total resources (degrees) of each miRNA and 
R — diag( t fly, i a 2 j, . . . , y^], 1 a n j), H mX m as the total resources of each 

EF and H — diagj | flfi, _ ( fl&» • • • , a im)> the resource matrix was 

obtained as F lnX and F t — F 0 W mX m or F± =F^ W nx „, where the transfer matrix 
W mX m = (FqH~ 1 ) t {R~ 1 Fo) or W nX n = (R- 1 F 0 )(F 0 fr 1 ) r 

Mathematically, an algorithm to predict other associations among the EFs, 
miRNAs and diseases in the EF-miRNA, EF-disease and miRNA-disease partite 
networks can be similarly deduced. 

EF structure similarity-based inference (ES-SBI). The hypothesis underlying this 
method is that if an EF ej associates with miRNAs or diseases by experimental 
evidence, then other EFs similar to ej tend to be linked with these e r associating 
miRNAs or diseases. For an unknown EMA, the linkage between e^ and mj is deter- 
mined by the predictive scoring function in formula (1). The association-predicting 
score for unknown ED As is shown in formula (2). 

E Srie^eijay 

M|=^ (1) 

E Srfe,*) 

n 

E S T (ei,e : )a,j 

D? = ^ (2) 

E S T (ei,ei) 

S T (ej,ei) indicates the Tanimoto similarity of the 2D chemical structures between 
EFs ej and e|. Detailed information about Tanimoto similarity can be found in 
Willett's work 52 , ay is adjacency matrix of N(E,M,A) in M|, and N(E,D,A) in Dfj. The 
structures of the EFs were transformed to MACCS keys using the OpenBabel soft- 
ware 53 . However, a small portion of the EFs could not be identified with structures, for 
example, pathogens, radiation and pollutants. The prediction lists for these cases were 
generated only by NBI. 

Disease phenotypic similarity -based inference (DP-SBI). This method was designed 
based on the hypothesis that diseases in the same phenotypic classification tend to be 
associated with similar EFs and miRNAs. The phenotypic similarity of two diseases 
was measured by finding their relative positions in the MeSH disease directed acyclic 



E S s {dj,di)a }j 



Performance assessment. Performance of all the models was evaluated by 10-fold 
cross validation. For each dataset, all links in the EMA, EDA and MDA networks were 
randomly divided into 10 parts of equal size. Each part was used as the validation set 
in turn, while the remaining nine parts served as the training set. To eliminate the 
error caused by separating datasets, all of the results were produced by a simulation of 
100 independent tests, and the receiver operating characteristic (ROC) curves were 
used. Due to random partitioning of the data, some EFs, miRNAs or diseases only 
existed in the test set without seed information in the training set. Links among these 
nodes were not considered in the performance assessment. 

Network visualization and analysis. The final predicted associations among EFs, 
miRNAs and diseases were obtained by the common prioritized lists of NBI and SBI. 
To visualize the relationships among the EFs, miRNAs and diseases, networks were 
constructed using Cytoscape 3.0 54 with the known associations generated by data 
integration and the predicted links found by the PEMDAM. The associations 
regarding breast cancer and cigarette smoke were then extracted to build the 
subnetworks during the case study analysis. 
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